<html>
<head>
<link rel=stylesheet href="style.css" type="text/css">
<title>A Tutorial - Getting Started With Collectl</title>
</head>

<body>

<body>
<center>
<h1>Tutorial - The Basics</h1>
</center>

<p>
Getting started using collectl may seem a little challenging to the new user because
of its many options, but it shouldn't be.  After all, how many people simply run the
<i>top</i> command and don't even realize there are a rich set of options available?
In that same spirit, you can simply enter the <i>collectl</i> command and get a lot
of useful information, but you would also be losing out on a lot.  The intent of
this tutorial is to give you a better appreciation of what you can do with collectl
and hopefully encourage you to experiment with even more options than those described
below.
<p>
<h3>Measuring Disk Activity</h3>
For this first set of examples I'll be using Robin Miller's 
<a href=http://www.scsifaq.org/RMiller_Tools/index.html>dt</a> to write a large file
to /tmp using the command <i>dt of=/tmp/test limit=1g bs=1m disable=compare,verify dispose=keep</i> while running collectl in another window:

<div class=terminal>
<pre>
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
  30  30   254     65      8      2   7920     97      0      4      0       2
  10  10   377     65      0      0  32500    282      4     52      2      19
  10  10   332     61      0      0  29312    246      0      3      0       3
   9   9   330     65      0      0  32512    275      3     45      1       9
  11  11   331     53      4      1  29684    270      0      2      0       2
   8   8   352     63      0      0  35004    273      3     33      1       8
  13  12   329    116      0      0  28924    249      0      2      0       2
</pre>
</div>

Here we see a few things including a burst of cpu activity when the test first
starts as well as an I/O rate of about 30MB/sec which corresponds to what dt is
telling us in the following summary line:

<pre>
Average transfer rates: 32051995 bytes/sec, 31300.776 Kbytes/sec
</pre>

If we compare the write rates to the number of writes we can also infer writes 
of about 128KB which is good to know because that means we're being efficient 
in the size of the data blocks being handed to the driver.
<p>
This may also be a good time to mention screen real estate.  There is a lot of information
that collectl can display and everything takes space!  More often than not you don't 
really care about time and so by default it isn't displayed.  However there may be times
you do care and so you can simply add the switch -oT add the option of time to the display.
In fact, sometimes you may want to include the date as well in which case -oD will do
both.  You can even show the times in msec by including <i>m</i> with -o, which can be
useful when running at sub-second monitoring levels and/or if you want to correlate data
to system or application logs with may themselves have finer grained time.
Here's an example of the command <i>collectl -scd -i.25 -oDm</i>
which shows the cpu and disk loads every quarter second and includes the date and time in msecs:

<div class=terminal>
<pre>
#                      <--------CPU--------><----------Disks----------->
#Date    Time          cpu sys inter  ctxsw KBRead  Reads KBWrit Writes
20080212 11:22:47.008    2   0   364     84      0      0  31328    284
20080212 11:22:47.258    8   6   392     92      0      0  30832    356
20080212 11:22:47.508    8   6   308     84      0      0  36256    268
20080212 11:22:47.758    2   0   292     44      0      0  31152    196
</pre>
</div>

<p>
So what about that CPU load?  Given that this is a 2 CPU system we might be
interested in seeing how that load is being distributed by running the command
<i>collectl -sC</i>, since an uppercase subsystem type, like cpu, disk
or network tells collectl to show instance level details:

<div class=terminal>
<pre>
# SINGLE CPU STATISTICS
#   CPU  USER NICE  SYS WAIT IRQ  SOFT STEAL IDLE
      0     0    0   17    0    0    0     0   83
      1     0    0    4    0    0    0     0   96
      0     0    0   14    0    0    0     0   86
      1     0    0    0    0    0    0     0  100
      0     0    0   20    0    0    0     0   80
      1     0    0    0    0    0    0     0  100
</pre>
</div>

noting all the load is being delivered by a single CPU as expected.  Ok, so now
let's read back the 1G file we just wrote and see what happens.

<div class=terminal>
<pre>
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
  38  37   248    189   7283    111      0      0      1      9      1       8
  24  23   153     81     32      0      0      0      2     32      1       9
</pre>
</div>

Now we see a big burst of CPU load and not much from disk.  Furthermore dt
is reporting

<pre>
Average transfer rates: 872960833 bytes/sec, 852500.813 Kbytes/sec
</pre>

which in fact confirms that reads are coming from cache and not disk since
no local disk can read at this rate!  In general, when doing disk I/O testing one
should use file sizes that are larger than cache to force all I/O to come 
from disk.  So repeating the tests with a larger file we now see more 
realistic read rates:

<div class=terminal>
<pre>
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
   9   8   773    743  41376    629      0      0      1      8      1       7
   9   8   619    639  31716    476      0      0      2     33      1       8
  16  15   510    554  23016    370      0      0      0      4      0       2
  10  10   572    624  27272    429      0      0      2     27      1       8
  16  15   458    504  19560    306     12      2      0      4      0       2
</pre>
</div>

So just what is happening to cache during testing?  To see memory utilization
we can simply add the memory subsystem to the default selections as
<i>collectl -s+m</i> but that also makes the display wider and since for our 
purposes we don't need network information I'm just going to run the following
<i>collectl -scmd</i>:

<div class=terminal>
<pre>
#<--------CPU--------><-----------Memory----------><----------Disks----------->
#cpu sys inter  ctxsw free buff cach inac slab  map KBRead  Reads KBWrit Writes
   3   0   159     80   2G 395M 189M   1M    0    0      0      0     20      3
   1   0   153     52   2G 395M 189M   1M    0    0      0      0      0      0
  43  42   238     68   2G 395M 340M 152M    0    0      0      0   3060     72
  25  25   376     53   1G 395M 431M 242M    0    0      0      0  29808    273
   6   6   377     59   1G 395M 455M 266M    0    0      0      0  30900    266
  10  10   347     55   1G 395M 492M 303M    0    0      0      0  35004    265
   5   4   389     60   1G 395M 506M 318M    0    0      0      0  27308    262
</pre>
</div>

and watch the cache fill up.  In fact, if we keep running collectl eventually
we use up all available memory (but that's what it's there for) and even after
the test completes and there is no more I/O, we still see hardly any free
memory.  But that too is ok because until someone else needs it or deletes
the file, that data stays in cache.  Look at the last sample where I manually
deleted the file.  You can see the cache drop to 204M and the free memory 
rise to 2G during a single reporting interval:

<div class=terminal>
<pre>
#<--------CPU--------><-----------Memory----------><----------Disks----------->
#cpu sys inter  ctxsw free buff cach inac slab  map KBRead  Reads KBWrit Writes
   1   1   374     91 171M 397M   2G   2G    0    0      0      0  34624    288
   1   1   368     82 171M 397M   2G   2G    0    0      0      0  31408    260
   2   2   319     56 171M 397M   2G   2G    0    0      0      0  31148    266
   0   0   385     70 172M 397M   2G   2G    0    0      0      0  25844    273
   0   0   167     70 172M 397M   2G   2G    0    0      0      0      0      0
   0   0   173     51 172M 397M   2G   2G    0    0      0      0      0      0
   2   0   181    108 172M 397M   2G   2G    0    0      0      0     12      2
  41  41   148     52   2G 397M 204M  15M    0    0      0      0     72      5
</pre>
</div>

For one more test, I'm going to write that same 1G file to my home directory and
look what collectl tells me:

<div class=terminal>
<pre>
#<--------CPU--------><----------Disks-----------><----------Network---------->
#cpu sys inter  ctxsw KBRead  Reads KBWrit Writes netKBi pkt-in netKBo pkt-out
   0   0   145     48      0      0      0      0      2     38      2      13
  13  13  6716   3491      0      0      0      0    136    682  21144   14762
  18  18  6802   3426      0      0      0      0    248   1256  39111   27278
  14  14  4680   2420      0      0     28      2    252   1256  40166   28008
   7   7  3105   1520      0      0      0      0    148    752  23256   16228
</pre>
</div>

Since my home directory is mounted via nfs, all I/O goes through the network!  In
fact, if I run collectl as <i>collectl -scfn</i> I see:

<div class=terminal>
<pre>
#<--------CPU--------><----------Network----------><--NFS Svr Summary-->
#cpu sys inter  ctxsw netKBi pkt-in netKBo pkt-out   read  write  calls
   0   0   144     48      0      2      0       2      0      0      0
  13  12  2839   1440      5     79      1      10      0      0      0
  17  16  6539   3061    216   1081  33976   23747      0      0      0
  15  14  6887   2704    253   1275  40020   27876      0      0      0
  12  11  4683   2209    252   1256  40166   28008      0      0      0
</pre>
</div>

I can clearly see the network traffic but why don't I see any nfs activity?
That's because by default collectl shows <i>server</i> activity rather
than the client and so we need to add the option -OC to the command which
now looks like <i>collectl -scfn -OC</i> and produces the following:

<div class=terminal>
<pre>
#<--------CPU--------><----------Network----------><--NFS Clt Summary-->
#cpu sys inter  ctxsw netKBi pkt-in netKBo pkt-out   read  write  calls
   2   1   142     44      0      5      0       3      0      0      0
   3   2  1562    693     46    253   6348    4456      0    256    285
  23  22  6735   3026    251   1250  40166   28007      0   1216   1255
  22  22  6633   3309    241   1220  38053   26541      0   1216   1253
  20  19  6657   3338    245   1223  39109   27271      0   1204   1242
</pre>
</div>

So what is the difference between <i>NFS Writes</i> and <i>NFS Calls</i>?
To best demonstrate this let's add <i>F</i> to the subsystems which like
other uppercase subsystems indicates <i>details</i> and in the case of nfs
this translates to breaking things down by metadata operations, commits and 
retransmissions:

<div class=terminal-wide14>
<pre>
#<--------CPU--------><----------Network----------><--NFS Clt Summary--><----NFS MetaOps---->
#cpu sys inter  ctxsw netKBi pkt-in netKBo pkt-out   read  write  calls   meta commit retran
  13  13  3625   1462    128    662  20114   14010      0    640    660      0     20      0
   9   9  3764   1165    132    662  21110   14742      0    672    693      0     21      0
</pre>
</div>

and we can now see those extra calls are in fact <i>nfs commits</i>.
<p>
So in conclusion you can see there is really quite a lot you can do with just a few basic
switches and I haven't even gotten into <i>--verbose</i>, which as they say is an exercise left for
the student.  So try some simple dt tests yourself or use you own personal favorite
load generator, while trying out <i>collectl -sc --verbose</i> or <i>collectl -sm --verbose</i> or even
<i>collectl -sn --verbose</i>.  You can even put them all together as 
<i>collectl -scmn --verbose</i>, but then as you'll see you end up using a lot of that valuable
screen real estate.  As a final bonus, try adding the -ot switch which puts collectl into something
I loosely call <i>top</i> format since each sample is displayed at the top of the screen.  That command
would then look like <i>collectl -smcn --verbose -ot</i>.
<p>
enjoy...

</body>
</html>
